Running a Quantum Circuit at the Speed of Data 

Nemanja Isailovic, Mark Whitney, Yatish Patel and John Kubiatowicz 

Computer Science Division 

University of California, Berkeley 

{nemanja, whitney, yatish, kubitron} @cs.berkeley.edu 



OO 
O 

o 

(N 

Oh- 
< 

o 



43 
0,: 



> 

in 

(N 

o 

00 

o 



X 



Abstract 

We analyze circuits for a number of kernels from popu- 
lar quantum computing applications, characterizing the 
hardware resources necessary to take ancilla preparation 
off the critical path. The result is a chip entirely domi- 
nated by ancilla generation circuits. To address this is- 
sue, we introduce optimized ancilla factories and analyze 
their structure and physical layout for ion trap technol- 
ogy. We introduce a new quantum computing architecture 
with highly concentrated data-only regions surrounded by 
shared ancilla factories. The results are a reduced depen- 
dence on costly teleportation, more efficient distribution 
of generated ancillae and more than five times speedup 
over previous proposals. 

1 Introduction 

Quantum computing shows great potential to speed up 
difficult applications such as factorization |fl~) and quan- 
tum mechanical simulation [2|. Unfortunately, quantum 
states are so fragile that all quantum bits, or qubits, in 
the system must be encoded for redundancy and remain 
encoded during computation. Various encoding method- 
ologies have been proposed [3, 4|, ranging from several 
to several dozen physical qubits used to represent a single 
encoded qubit to be used in the high-level computation. 

It is expected that an encoded qubit will need to un- 
dergo a Quantum Error Correction (QEC) step after each 
"useful" basic gate is performed upon it. However, the 
bulk of a QEC operation is a preparation circuit involv- 
ing the creation of encoded ancillary qubits, or ancillae, 
which does not involve the data qubit to be corrected. 
Consequently, as Chi et al. point out in |5|, the critical 
path of a quantum circuit could be significantly reduced 
if the ancilla preparation work were done in parallel with 
useful computation. In particular, the speed of a quantum 
computation would be limited solely by data dependen- 
cies between encoded qubits. We refer to this fully offline 
parallelization of data-independent work as running the 
circuit at the speed of data. 
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Figure 1: (a) Standard implementation of a circuit involving 
qubits QO, Ql and Q2. Only the grey blocks represent inter- 
actions with actual data. The bulk of the critical path involves 
independent ancilla preparation, (b) An optimized version of 
the circuit in which ancilla preparation is pulled off the critical 
path through use of increased hardware. Here, the speed of the 
computation is limited only by data dependencies (grey blocks). 



Figure QJ shows a possible execution of a simple se- 
ries of quantum gates involving qubits QO, Ql and Q2. 
Each gate involves some encoded ancilla preparation for 
the QEC step which must follow it. In addition, some 
gates, called non-transversal gates, require further en- 
coded ancilla preparation simply to be performed (elab- 
orated upon in Section lZ^l i. FigureQJ shows these opera- 
tions performed at the speed of data. Chi et al. suggest 
that these ancilla preparation operations could be done 
in advance, but the hardware cost for this parallelization 
grows quickly as the critical path is shortened. 

In Section [2] we investigate quantum circuits for en- 
coded ancilla preparation and evaluate them in terms of 
error and complexity. In Section|3] we identify three com- 
mon subcircuits of larger quantum algorithms and evalu- 
ate their characteristics concerning encoded ancilla needs 
for both QEC and non-transversal gates. In Section|4] we 
detail the layout and throughput of a pipelined ancilla fac- 
tory specialized for generating encoded ancilla qubits. In 
Section [5] we combine our analyses to answer the overall 
question of the feasibility of running a quantum circuit at 
the speed of data, and we conclude in Section|6] 



2 Ancilla Preparation Circuits 

Typical quantum circuits require many encoded ancilla 
qubits. In this section, we discuss several ancilla prepara- 
tion circuits and evaluate them in terms of complexity and 
error. Ultimately, we select encoding circuits that will be 
used in our layouts in Section|4] 
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Figure 2: A quantum error correcting (QEC) operation is com- 
posed of a bit-flip correction and a phase-flip correction, corre- 
sponding to the two types of errors that can happen to a qubit. 
The thick bars represent encoded qubits. 



2.1 Computing on Encoded Data Bits 



2.2 Circuit Evaluation Methodology 

Since encoded ancillae are a major component of error 
correction, it is critical to generate clean ancillae to avoid 
introducing errors during the correcting process. In the 
following, we will evaluate circuits by using the tools in 
[8 1 which allow us to lay out circuits. The effects of er- 
ror are then modeled by Monte Carlo simulation where 
errors can be introduced at any gate or qubit movement 
operation. Additionally, we model the fact that two-qubit 
gates propagate bit and phase flips between qubits. This 
simulation is similar to what was done in [4] except with 
the addition of qubit movement error from our detailed 
layout. We assume an independent error probability for 
each gate and movement operation. The gate error rate is 
10~ 4 and the error per movement op is 10~ 6 . Our gate 
and movement error rates are consistent with [9|. 



Since quantum data is very fragile, it must be encoded at 
all times in an appropriate quantum error correction code. 
A high-level view of the procedure for error-correcting an 
encoded data qubit is shown in Figure [2] Both the bit 
value and phase must be repaired during the QEC step [ 6 1 . 
Two sets of physical ancilla qubits are each encoded into 
the zero state and then consumed during correction. 

Gates applied to encoded data may be classified into 
two types: transversal and non-transversal. A transversal 
encoded gate is applied by performing the corresponding 
physical gate independently on each of the qubits com- 
prising the encoded qubit, as shown in Figure [3^ for the 
Hadamard gate. A non-transversal encoded gate is de- 
composed into a more complex set of physical operations, 
including multi-qubit physical operations between phys- 
ical qubits within the same encoded qubit; for example, 
see the Basic Encoded Zero Ancilla Prepare in Figure[3p. 
Since errors are propagated between physical qubits dur- 
ing the application of non-transversal gates, such gates 
must be designed carefully to avoid introducing uncor- 
rectable errors. 

A class of quantum codes known as CSS codes J7] [3) 
allow transversal implementations of most encoded gates. 
For this reason, CSS codes are used in most analyses of 
the fault tolerance of quantum circuits. For the rest of this 
paper, we use the [[7,1,3]] CSS code (7). Encoded gates 
that can be performed transversally on this code include 
the two-qubit CX, as well as the one-qubit X, Y, Z, Phase, 
and Hadamard gates. In order to have a universal gate set, 
we also need the non-transversal jc/8 gate and the encod- 
ing procedure to create an encoded ancilla. We will dis- 
cuss how to obtain a fault tolerant version of the Jt/8 gate 
later in this section, but first we investigate the problem of 
getting a fault tolerant encoding procedure. 



2.3 Encoded Ancilla Preparation 

Since the Bit Correct and Phase Correct circuits in Fig- 
ure |2 are fully transversal (each consisting of a transver- 
sal CX, measure and conditional correct [10Q, we focus 
on the basic zero ancilla preparation circuit, shown in Fig- 
ure^. The probability of an uncorrectable error in the re- 
sulting encoded output of this circuit is 1.8 x 10~ 3 based 
on our evaluation methodology above. We would like to 
improve on this basic result. 

There are two different circuit-level techniques for re- 
moving general errors from an encoded qubit: verifica- 
tion and correction. Verification tests a qubit in a known 
state for error and discards it if too much error is found. 
Correction is more complex, but it corrects a bit or phase 
error from an encoded qubit in an unknown state, thus it 
is more suitable for data qubits in a long-running compu- 
tation. Encoded zero ancillae are in known state and may 
be discarded if necessary, so either method is suitable. 

While Figure |3j) shows the circuit for preparing an en- 
coded ancilla in the zero state in the [[7,1,3]] CSS code, 
we would like a more error-free ancilla qubit for interac- 
tion with data. Figure|4]shows some example zero ancilla 
preparation circuits from the literature [11 10 1, with the 
overall error rate for each given under the circuit. Cor- 
rection alone (Figure^) loses to verification alone (Fig- 
ure Hk) in both error and area. When comparing Fig- 
ures @k and @j;, it is important to note that they are not 
to scale. The "Basic 0" module (expanded in Figure^) 
is by far the most complex, so by doing both verification 
and correction, we get more than an order of magnitude 
improvement in error over verification alone for slightly 
more than three times the area. Thus, we shall use the 
circuit in FigureHJ; in this paper. 

Since we are using qubit verification as part of our en- 
coded zero preparation, we need to know the success rate 
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Figure 3: (a) A transversal encoded gate involves transversal application of physical gates, (b) A non-transversal encoded gate 
involves multi-qubit physical operations between physical qubits within the same encoded qubit. 
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Figure 4: Different circuits for the "High-Fidelity Encoded Zero Ancilla Prepare" in Figure[2] Each "Basic 0" module corresponds 
to the circuit in Figure[3p. Each "Cat Prep" module corresponds to the preparation of a special 3-qubit state. Thick bars are encoded 
qubits (seven physical qubits). The overall error rate of each is given under each circuit. 



of verification. Using the same Monte Carlo simulation 
used for error probability calculations, we estimate the 
verification failure rate of the subunit|4^ to be 0.2%. We 
will use this in calculations later in Section|4~4l 



2.4 Fault Tolerant n/8 Gate 

It has been shown that no quantum error correcting code 
has transversal gate implementations for all the gates in a 
universal set iPPHl . and indeed, in the [[7,1,3]] CSS code, 
we need the non-transversal Jt/8 gate in order to com- 
plete the universal set. In order to maintain fault toler- 
ance when performing the 7t/8 gate on a [[7,1,3]] encoded 
qubit, we use a technique developed in [13|. Their ap- 
proach is to generate an encoded ancilla qubit encoded in 
the jt/8 state and perform transversal interactions with the 
data, as shown in Figure [5^, to achieve the overall effect 
of an encoded tc/8 gate. 

To encode the tc/8 ancilla qubit, we could try to cre- 
ate a physical tc/8 ancilla qubit and then use the encoding 
circuit in Figure [3j>, but this would result in errors on the 
original physical qubit propagating to each physical qubit 
in the final encoded ancilla, which is unacceptable. Thus, 
we require the far more complicated circuit shown Fig- 
ure[5[3, which consists of an encoded zero ancilla prepare, 
a 7-qubit cat state prepare (where a cat state is a specially 
prepared multi-qubit state) and a series of transversal en- 



coded gates. 

2.5 Fault Tolerant %/2 k Gates 

The Quantum Fourier Transform (QFT) requires con- 
trolled phase rotation gates by small angles (these gates 
replace the explicit tracking of roots of unity in the clas- 
sical FFT algorithm). The amount of precision for these 
gates scales exponentially in the number of bits involved 
in the QFT (6). A controlled phase rotation by tc/2* can 
be generated by a CX gate and 3 single qubit %/2 k+x gates 
|fl4l . Thus, using circuit techniques mentioned so far, we 
can implement every gate in the QFT fault tolerantly ex- 
cept these single qubit rotation gates. There are two prob- 
lems with implementing an arbitrary precision phase rota- 
tion fault tolerantly: 

• For angles smaller than tc/2, there is no transversal 
gate implementation using the [[7,1,3]] code lfl2l . In 
fact, this seems likely to be true for all codes. 

• Such a gate would require the physical implemen- 
tation of an arbitrary precision rotation - a difficult 
burden on the engineers of these devices. 

Due to the above reasons, we adopt a technique by Fowler 
lfl"4l . To approximate small angle rotations, we exhaus- 
tively search all permutations of T and H gates to find a 
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Figure 5: (a) Applying an encoded Jt/8 gate on an encoded data qubit involves creating an encoded Jt/8 ancilla and performing 

some transversal gates, (b) Creating the encoded Jt/8 ancilla used in the circuit in (a) requires an encoded zero ancilla, a 7-qubit 

cat state (a specially prepared qubit set) and a series of transversal gates. Note that the Jt/8 gate near the far right is transversal but 

does not implement an encoded Jt/8 gate. 
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Figure 6: Fault tolerant Jt/2 gates can be performed recursively 
with a cascade of Jt/2' \i = 3...k ancilla factories and k — 2 CX 
and X gates. Each measure gate output controls both the sin- 
gle qubit X gate and the compound gate involving more ancilla 
factories. Each measurement has a equal chance of giving the 
"correct" state, in which the remaining circuit is skipped or a 
"wrong" state in which a larger rotation has to be done to adjust 
the state. The actual output data from the circuit connects to the 
first quantum bitline associated with a correct measurement. 



minimum length sequence for a Tt/2* rotation gate up to 
an acceptable error. 

We also note that if a %/2 k physical gate is available 
in a given technology, an exact fault-tolerant n/2 k can be 
implemented as shown in Figure [6] In order to be conser- 
vative about the availability of arbitrary precision rotation 
gates, we do not use this construction in the circuits in this 
paper. However, in Section |4.4.2| we briefly analyze the 
performance advantages of this technique. 
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Table 1: The latency values for various physical operations in 
ion trap technology |9] 1151 [T6l . Since these change as more 
experiments are done, we show many of our results in a symbolic 
fashion before plugging in these values. 



work focuses on the short-lived ancillae which need to be 
produced in large quantities and which may more easily 
be discarded and re-encoded. 

We do most of our analysis in a symbolic fashion so that 
it may be applied to varying technologies and assump- 
tions. However, we will also be applying the analysis to 
a specific technology, trapped ions [ 17], in order to make 
the results of our calculations more concrete. We use the 
physical gate latencies shown in TableQ] the [[7,1,3]] CSS 
code introduced in Section 12.11 and the encoded ancilla 
preparation circuits shown in Figures^; and[5}3. Note that 
the "Zero Prepare" in Table [1] refers to a physical zero 
prepare, which is the leftmost set of gates in the Basic 
Encoded Zero Ancilla Prepare in Figure [3^. 



3 Circuit Characteristics 

We now characterize the runtime properties of some com- 
monly used quantum circuits, focusing on the impact of 
encoded ancilla generation. Many quantum algorithms 
require ancillae to assist in computation. For example, 
an «-bit Quantum Ripple-Carry Adder uses two n-bit data 
inputs plus n + 1 ancillae. In addition to this, shorter- 
lived ancillae are needed for QEC and for performing non- 
transversal encoded gates, as discussed earlier. 

Throughout this paper we refer to the longer-lived an- 
cillae used in the main computation as "data ancillae" and 
to the shorter-lived ones as "ancillae." We make this dis- 
tinction because data ancillae tend to have long enough 
lifespans that "discarding" them and restarting their por- 
tion of the computation has a relatively high cost. Our 



3.1 Benchmarks 

For our benchmarks, we use the 32-bit Quantum Ripple- 
Carry Adder (QRCA) circuit from fl]D, the 32-bit Quan- 
tum Carry-Lookahead Adder (QCLA) circuit from lfl9l 
and a 32-bit Quantum Fourier Transform (QFT) circuit 
we derived using methodology described in Section 12.51 
All three are core kernels of a varied array of quantum 
algorithms, including Shor's factorization algorithm. 



3.2 QEC Circuit Characteristics 

We study our benchmark circuits at two extremes of the 
latency-area trade-off: 1) No overlap of QEC and compu- 
tation (high latency, but low area), and 2) infinitely fast 
encoded ancilla production, resulting in an execution lim- 



Circuit 


Data Op Latency (ps) 
(% of total) 


Data QEC Interact Latency (/js) 
(% of total) 


Ancilla Prep Latency (/js) 
(% of total) 


32-Bit QRCA 
32-Bit QCLA 
32-Bit QFT 


29508 (5.2%) 
3827 (5.3%) 
77057 (5.0%) 


95641 (16.7%) 

11921 (16.7%) 

365792 (23.7%) 


447726 (78.2%) 

55806 (78.0%) 

1097376(71.2%) 



Table 2: Relative latency of useful data operations, interaction of data with encoded ancillae for QEC and encoded ancilla prepara- 
tion for QEC for various circuits, assuming no overlap between them. 
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Progress Through Execution of 32-Bit QRCA (u s) 



2000 4000 6000 8000 10000 12000 14000 
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Figure 7: Encoded zero ancilla needs for the QRCA (left), QCLA (middle) and QFT (right) to run at the speed of data. 
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Bandwidth Needed 


Circuit 


For QEC 


For 7t/8 Gates 


32-Bit QRCA 


34.8 


7.0 


32-Bit QCLA 


306.1 


62.7 


32-Bit QFT 


36.8 


8.6 



Table 3: Average encoded ancilla bandwidths needed for QEC 
and non-transversal gates (in encoded ancillae per millisecond) 
if each circuit is to be executed at the speed of data. 



ited only by data dependencies (low latency, but poten- 
tially much higher area for encoded ancilla generation). 

Table [2] shows for each benchmark the latency of the 
critical path in the absence of movement (Column 2), 
as well as latencies for the data-dependent and data- 
independent (Columns 3 and 4) portions of QEC steps, 
assuming a QEC operation must be performed after each 
useful gate. The minimal running time is the sum of 
Columns 2 and 3, since these involve data qubits. Col- 
umn 4 corresponds to encoded ancilla generation time. 
Clearly, there is much to be gained in overall execution 
time by taking ancilla preparation off the critical path. 

Figure shows for the QRCA (left), QCLA (middle) 
and QFT (right) the number of encoded ancillae used for 
QEC which need to be in the system as execution pro- 
gresses in order to keep the circuit operating at the speed 
of data. This means that adequate hardware resources ex- 
ist to generate and distribute the needed ancillae in time, 
but the interaction with data during each QEC step is still 
on the critical path of execution. Table [3] summarizes this 
figure by giving the average encoded ancilla bandwidth 
necessary for each. 



These averages do not take into account the handling 
of peak periods. In reality, the encoded ancilla bandwidth 
necessary to run a circuit optimally may be higher than the 
average bandwidth. Figure [8] shows for the QRCA (left), 
QCLA (middle) and QFT (right) the circuit execution time 
assuming a steady throughput of encoded ancillae being 
generated, as specified on the x-axis. These graphs show 
us the sustained ancilla bandwidth necessary to run each 
circuit at near-optimal speed, but these are only estimates 
since they lack the details of movement and layout. In 
Section|4] we examine the associated hardware needs. 

3.3 Non- Transversal One-Qubit Gates 

The encoded ancilla bandwidth needs discussed in Sec- 
tion 13.21 for our three benchmarks include only zero an- 
cillae needed for error correction. Non-transversal one- 
qubit gates account for 40.5%, 41.0% and 46.9% of our 
QRCA, QCLA and QFT benchmarks circuits, respec- 
tively, when using the [[7,1,3]] encoding. As explained 
in Section l2~4l the execution of a non-transversal encoded 
gate is performed with the use of a 7t/8 encoded ancilla 
qubit. Column 3 in Table [3] shows the corresponding 7t/8 
ancilla bandwidth needed for each benchmark to achieve 
a runtime limited only by the speed of data (the sum of 
Columns 2 and 3 in Table |2). 



4 Ancilla Factory Layout 

In this section, we shall explore the design space of possi- 
ble ancilla factories and determine the hardware resources 
necessary to produce encoded ancillae at the bandwidths 
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Figure 8: The execution time of the QRCA (left), QCLA (middle) and QFT (right) as a function of a steady throughput of encoded 
zero ancillae. The vertical line in each shows the average bandwidth for that circuit from Table|3] 



Physical Operation 


Latency Symbol 


Latency (/./s) 


Straight Move 
Turn 


tmove 
hum 


1 
10 



Table 4: Latency values for the two types of move operations 
in ion trap technology |9, 15, 16|. A Straight Move is across a 
single macroblock (Figure[9j. 
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Figure 10: Layout of a single encoded data qubit. 
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Figure 9: The abstract building blocks of our layouts. Black 
boxes are gate locations (which may not occur in an intersec- 
tion), grey boxes are abstract "electrodes," and wide white chan- 
nels are valid paths for qubit movement. 



calculated in Sections [3. 2l and [331 in order to take ancilla 
generation off the critical path of execution. 

4.1 Ion Trap Abstraction 

Our area calculations are done using an abstraction of ion 
trap technology [ 17 1, described here. 

Qubits: A single qubit capable of holding one bit of 
quantum state is an ion. The physical implementation of a 
qubit is actually more complicated, but for our purposes, 
we may represent each qubit as a single ion. 

Movement: Electrodes are used to create potential 
wells in which qubits (ions) are trapped. Potential wells 
and the ions within are moved via an application of precise 
pulse sequences to the electrodes. Moving an ion around 
a corner takes more time than moving straight |[2()1 . The 
latency numbers we use are shown Table |4] 

Gates: A gate is performed by firing precise laser pulses 
at a trapped ion. We may abstract away the physics and 
consider that a gate is performed by arrival at certain spe- 
cial "gate locations" in the layout. 



Macroblocks: Since qubit movement is performed by 
electrodes whose position is fixed at fab time, certain 
"channels" for qubit movement are also set at fab time. 
The details of electrode structure are still evolving, so de- 
termining area in terms of number of ion traps is a bit am- 
biguous. For this reason, we use the macroblocks shown 
in Figure [9] as the basic building blocks of our layouts. 
Each macroblock has one or more "ports" through which 
qubits may enter and exit and which connect to an ad- 
jacent macroblock. To perform a gate operation, all in- 
volved qubits must enter a valid gate location (a black 
square in our macroblocks) and remain there for the du- 
ration of the gate. Our area numbers are all calculated in 
terms of macroblock count. 



4.2 Data Qubit Area 

Over the run of a quantum circuit, encoded data must per- 
form four distinct types of operations: transversal one- 
qubit gates, non-transversal one-qubit gates, transversal 
two-qubit gates and QEC steps. As described in Sec- 
tion 12.41 a non-transversal one-qubit gate may be per- 
formed by preparing a specific encoded ancilla and inter- 
acting it transversally with the data qubit. Likewise, the 
data/ancilla interaction portion of a QEC step involves a 
transversal two-qubit gate. In the end, the main opera- 
tions the encoded data must support are transversal one- 
and two-qubit gates. 

To support these major operations, we use single com- 
pute regions as shown in Figure \TU\ The design con- 
sists of a single column of Straight Channel Gate Mac- 
roblocks with enough room for a single encoded qubit 





/ D 
/ D 
/ D 


on 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


n 
n 
n 




nn 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


For Bit 


' □ 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


nn 






ninninninninninninninninninnin 


Correction 


n 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


nn 






\ □ 

\ D 

\ n 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


u 
n 
n 




nn 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


Encoded 


/ D 
/ D 
/ D 


nn 
nn 


nn 
nn 


nn 
nn 


nn 
nn 


nn 
nn 


nn 
nn 


nn 
nn 


nn 
nn 


nn 
nn 


u 
n 
n 


Ancilla 


' n 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


nn 






ninninninninninninninninninnin 


for Use 


\ D 


nn 


nn 


nn 


nn 


II 


nn 


nn 


nn 


nn 




with Data 


\ D 

\ D 
\ D 


nn 
nn 


nn 
nn 


nn 
nn 


nn 
nn 


nn 
nn 


nn 
nn 


nn 
nn 


nn 
nn 


nn 
nn 


n 
n 
n 




/ D 
/ n 
/ D 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


u 
n 
n 


For Phase 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


' D 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


i 


Correction 


ninninninninninninninninninnin 


\ □ 


nn 


nn 


nn 


nn 


II 


nn 


nn 


nn 


nn 






\ I 


nn 


nn 


nn 


nn 


II 


nn 


nn 


nn 


nn 






\ n 
























nn 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


nn 


u 






Encoded Ancillae 




Verification 


3ubi 


ts 



Figure 11: An ancilla factory for the circuit in Figure [4};. Each 
row of gates generates and verifies one of the three encoded zero 
ancillae, then bit and phase correction are performed. 



(seven macroblocks for the [[7,1,3]] CSS code), with ac- 
cess on either side to whatever interconnect network is be- 
ing used. Thus, if we are encoding each qubit into m phys- 
ical qubits, the total area used by data is m X n q , where n q 
is the total number of data qubits (including data ancillae) 
in the circuit. 



4.3 Simple Ancilla Factories 

We now focus on designing an ancilla factory, a concept 
first proposed in [21 1. An ancilla factory is a portion of 
the layout which consumes stateless physical ancillae and 
produces a steady stream of encoded ancillae at some rate. 
Figure QT| shows a simple ancilla factory to execute the 
circuit in Figure [4};. Each row of gates has room for ten 
physical qubits, seven to be encoded and three for veri- 
fication. The adjacent rows are used for communicating. 
When all three are encoded and verified, the middle one 
is bit-corrected by the top one and phase-corrected by the 
bottom one. Using a hand-optimized schedule, the total 
latency of a single ancilla preparation is approximately: 

tprep + 2 X t meas + 6 X ?2g + 2 X t\ q + 8 X t turn + 30 X t move . 

Substituting in the ion trap latencies in Tables Q] and 
|U the layout in Figure QT| has a total latency of 323/js 
with a throughput of 3. 1 encoded ancillae per millisecond 
and an area of 90 macroblocks. Using this simple ancilla 
factory, we could produce any desired bandwidth of en- 
coded ancillae by replicating the layout as many times as 
necessary. Unfortunately this design is inefficient in that 
the verification qubits needlessly take up space during the 
seven-qubit zero encoding procedure. To combat this in- 
efficiency we instead consider a pipelined approach. 
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Figure 13: A layout of each unit in Figure [T2l 



4.4 Pipelined Ancilla Factories 



Classically, pipelining a circuit is done by inserting syn- 
chronization points (registers) into the circuit's datapath 
to enable logic reuse, thereby increasing throughput with 
a small increase in latency. We can apply a similar tech- 
nique to our ancilla factory layout in an effort to im- 
prove area utilization. Due to the precise electrode and 
laser pulse sequences needed to implement movement and 
gates, ion trap layouts are by definition synchronous with- 
out additional synchronization elements. Instead, we must 
add a set of communication channels between pipeline 
stages allowing qubit movement for maximum gate loca- 
tion occupancy. 





Unit 


Total 


Total 


Functional Unit 


Count 


Height 


Area 


Zero Prepare 


24 


24 


24 


CX Stage 


1 


4 


28 


Cat State Prepare 


1 


2 


6 


Verification 


3 


30 


30 


B/P Correction 


2 


42 


42 



Table 6: The functional unit counts and stage characteristics for 
the encoded zero ancilla factory in Figure [12] The CX and Cat 
Prepare units in Stage 2 are bandwidth matched to a ratio of 7 
to 3 (which is appropriate for verification), and then the other 
stages are matched to this. 



4.4.1 Encoded Zero Ancilla Factory 

We consider the entire circuit for fault tolerant encoded 
zero ancilla creation (Figure|4]:). Figure Q~2] shows a fully 
pipelined microarchitecture for this circuit, which consists 
of four stages. Each stage contains a number of functional 
units for its subcircuit such that the output bandwidth of 
one stage is matched to the input bandwidth of the next. 
Adjacent stages are separated by a crossbar (Figure [13^), 
which consists of two vertical columns, fully connected 
horizontally, one for upwards movement, the other for 
downwards. 

Stage 1 consists of preparing a junk physical qubit into 
the zero state with an optional Hadamard gate at a single 
gate location (Figure [T3b). Even though only some of 
these qubits need the Hadamard, we group them all into 
the same set of functional units. 

Stage 2 consists of two types of units. Looking at the 
CX portion of the ancilla prepare circuit in Figure [3J), we 
see that the first three CX's can be performed in parallel, 
as can the next three, followed by the final three. Thus, 
we may use the pipelined layout in Figure [T3c for this 
functional unit, with three sets of qubits (each performing 
three CX's with one idle qubit) in this functional unit at a 
time. The Cat Prep units (Figure [T3tD create a three-qubit 
cat state out of three physical zero ancillae by performing 
two CX's in succession. 

Verification of the encoded zero ancillae using the cat 
states is performed in Stage 3 and involves performing 
three CX's in parallel and then measuring the cat state 
qubits to determine success or failure of the encoded an- 
cilla. Since the encoded ancilla qubits must wait for the 
measurement to complete, we need 10 macroblocks, one 
for each qubit as shown in Figure [T3k. When this is done, 
the three qubits of the cat state are recycled immediately, 
as well as the other seven qubits if the verification failed. 

Finally, in Stage 4, a verified encoded zero ancilla A is 
first bit-corrected by a verified encoded zero ancilla B and 
then phase-corrected by a verified encoded zero ancilla C. 
Since we need storage room for A plus room to measure 
both B and C in parallel (allowing us to overlap these mea- 



surements in time), each such functional unit needs space 
for three encoded ancillae, as shown in Figure [T3T. 

Table [5] summarizes the latency breakdown for each 
stage of the pipeline and provides numerical values for 
various characteristics of each functional unit under our 
ion trap assumptions. Note that Stages 3 and 4 have input 
bandwidth different from output bandwidth due to the fact 
that some qubits are used up and recycled in these stages. 
To achieve high resource utilization, we determine unit 
count by matching bandwidth between successive stages. 
The results are shown Table [6] 

For the crossbars, we use a two-column design, one col- 
umn for upwards movement, the other for downwards, in 
order to avoid congestion. However, physical qubits ex- 
iting Stage 1 are funneled inward to the much smaller 
Stage 2, so we use a single column crossbar since bi- 
directionality is likely unnecessary. The total crossbar 
area is thus 24 + 2 * 30 + 2 * 42 = 168 macroblocks, and 
the total functional unit area is 24 + 34 + 30 + 42 = 130 
macroblocks, resulting in a total area of 298 macroblocks. 

For overall throughput, we take the minimum through- 
put among the stages. The bottleneck in the factory is the 
CX Stage. Each seven physical qubits out of this stage 
correspond to an encoded zero ancilla. Approximately 
99.8% of these qubits are successfully verified (using the 
results of our Monte Carlo simulations mentioned in Sec- 
tion |231), and two-thirds of them are then used to correct 
the other third. Thus, the overall throughput of our zero 
ancilla factory is: 22LI x 0.998 X A = 10.5 encoded ancil- 
lae / ms. 



4.4.2 Encoded jc/8 Ancilla Factory 



In Section l33l we showed that a non-trivial supply of en- 
coded 7t/8 ancillae are also needed by our circuits. The 
circuit in Figure^ shows how to turn a zero ancilla gener- 
ated by our pipelined ancilla factories into an encoded 7i/8 
ancilla. This circuit may be divided into four stages: 1) 
Cat State Prepare, 2) Transversal Controlled-Z/S/X, plus 
Transversal 7t/8, 3) Decode, 4) One-qubit H, One-qubit 
Measure, Transversal Z conditional on measurement. 

Table [7] shows the characteristics of each of these 
stages. Note that bandwidths here are in physical qubits, 
which is why Stages 1 and 3 have differing bandwidths 
despite having the same latency. We now match band- 
widths just as we did for the zero ancilla factory in order 
to get close to full utilization. Table [8] shows the the final 
unit counts of our 7i/8 ancilla factory. Note that only half 
the qubits consumed by Stage 2 come from Stage 1 (the 
others come from an encoded zero ancilla factory). 

The total stage heights are different enough that an 
exact layout would likely require partially folding some 
stages into others and simulating execution to determine 
exact crossbar sizes needed to avoid congestion. For our 







Latency 




BW (qubits/ms) 


Area 


Functional Unit 


Symbolic Latency 


0*) 


Stages 


In 


Out 




Zero Prep 


*prep ~T t\q -r Z X tj urn -\- t move 


73 


1 


13.7 


13.7 


1 


CX Stage 


3 X t2q + 6 X ?/ Mrn + 5 X fmove 


95 


3 


221.1 


221.1 


28 


Cat State Prep 


2 x ^2g t4x ?/ wm "t" 2 X t move 


62 


2 


96.8 


96.8 


6 


Verification 


hneas T" '2iy "T ^ X ffurn + Z X t move 


82 


1 


122.0 


85.2 


10 


B/P Correction 


tmeas^^ X ?2<7 + 6 X ^ Hra -j- 8 X f m 0Ve 


138 


1 


152.2 


50.7 


21 



Table 5: For each functional unit in Figure Q21 Column 2 gives its symbolic latency. The remaining columns give numeric values 
using our ion trap assumptions. "Stages" is the number of pipeline stages within the functional unit itself, and "Area" is given in 
number of macroblocks. 



Stage 



Cat State Prepare 
Transversal CX/CS/CZ/lt/8 
Decode (plus Store) 
H/M/Transversal Z 



Symbolic Latency 



7 X t 2q + 14 X tturn + 8 x tmove 

3 X t2q + 2 X tt urn + 3 X t move 

7 X t 2q + 14 X t, um + 8 X tmove 

tmeas T Z X q^+Z Xfy wrtt +Z Xf mg y e 



Latency In BW Out BW Area 



218 
53 
218 
74 



32.1 
264.2 
64.2 
108.1 



32.1 
264.2 
36.7 
94.6 



12 
7 
19 



Table 7: For each stage in the encoded ir/8 ancilla generation circuit, we give its symbolic latency, plus numeric values for various 
characteristics of the stage under our ion trap assumptions. 





Unit 


Total 


Total 


Stage 


Count 


Height 


Area 


Cat State Prepare 


4 


24 


48 


Transversal CX/CS/CZ/ji/8 


1 


7 


7 


Decode (plus Store) 


4 


52 


76 


H/M/Transversal Z 


2 


16 


16 



Table 8: The functional unit counts and characteristics for each 
stage of our final 7t/8 ancilla factory. 



5 Architectural Trade-offs 

We now bring our analyses together to draw quantita- 
tive conclusions about running a quantum circuit at the 
speed of data and to compare against proposed architec- 
tures from prior work. Following that, we present a more 
qualitative discussion of some conclusions we've drawn 
from this work. 



purposes, we will allocate two columns to each crossbar, 
since qubits must be able to move in both directions at the 
same time. Thus, the total crossbar area is 2 * 24 + 2 * 
52 + 2 * 52 = 256 macroblocks, and the total functional 
unit area is 48 + 7 + 76 + 16 = 147 macroblocks, resulting 
in a total area of 403 macroblocks. Note that this is only 
the area for turning an encoded zero into an encoded 7t/8. 
This factory needs to be supplied by zero ancilla factories 
in order to function, which we account for in Section 

The bottleneck of this ancilla factory is the Cat State 
Prepare stage. Each seven-qubit cat state produced by this 
stage results in one encoded 7t/8 ancilla produced by the 
factory, so the throughput of the factory is equal to the 
throughput of this stage: 18.3 encoded 7t/8 ancillae / ms. 



As mentioned in Section 12.51 we build up smaller an- 
gle 71/2* rotations from combinations of 7T./8 and H gates 
instead of building ancilla factories for them. It is worth- 
while to note that if physical gates with adequate preci- 
sion are available, the critical path for the data can be de- 
creased further. From Figure [6] we see that the critical 
path for the data through such a factory would on average 
consist of EzTq 1 /2* CX gates and one fewer X gates. 



5.1 Matching Production to Need 

We divide the microarchitecture of a quantum layout into 
three components: 1) hardware resources for generation 
of encoded ancillae; 2) hardware resources for data op- 
erations, including operations involving data ancillae and 
the data/ancilla interaction portion of a QEC step; and 3) 
an interconnection network for moving around both en- 
coded data and ancillae. Figure [T4h shows the (C)QLA 
microarchitecture ll22l [15) using these components, with 
each data qubit (whether in a compute region or memory) 
having an associated ancilla generation unit for QEC. Fig- 
ure [14b shows an ancilla factory -based microarchitecture 
wherein encoded ancillae are being generated across the 
chip and distributed to data as need dictates. 

Table[9]gives the relative areas of two of the three com- 
ponents of the microarchitecture in Figure [T4b when run- 
ning our benchmarks at (or near) the speed of data under 
our ion trap assumptions. We depict our microarchitec- 
tural components to scale for the 32-bit QCLA in Fig- 
ure[T4b. The encoded zero ancilla bandwidth for error cor- 
rection is the average bandwidth required for each circuit 
(Table 0). A corresponding encoded 7i/8 ancilla band- 
width is computed (but not shown in the table) to run the 
circuit at that speed. Column 4 includes only those zero 
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Figure 14: A quantum layout microarchitecture may be considered to consist of three components: generators of encoded ancillae, 
data qubit computation regions and interconnect, (a) The (C)QLA microarchitecture dedicates an ancilla generation unit to each 
data qubit. (b) Our general microarchitecture redirects encoded ancillae to wherever they're needed on the chip, thus avoiding idle 
generators, (c) In order to run at the speed of data, the ancilla generation portion of the chip needs far more hardware than the data 
regions, as shown in Table[9] 



Quantum Circuit 


Encoded Ancilla 
Bandwidth For QEC 


Data Area 
(% of total) 


QEC Ancilla Factories 
Area (% of total) 


Jt/8 Ancilla Factories 
Area (% of total) 


32-Bit QRCA 
32-Bit QCLA 
32-Bit QFT 


34.8 
306.1 
36.8 


679 (33.6%) 
861 (6.8%) 

224 (13.2%) 


986.9 (48.8%) 
8682.2 (68.4%) 
1043.5 (61.3%) 


354.7 (17.6%) 
3154.4 (24.8%) 
433.7 (25.5%) 



Table 9: Area breakdown to generate encoded ancillae at the QEC bandwidths shown in Table[3] The Jt/8 ancilla bandwidth is 
computed to match. The last column includes area for both tc/8 encoding and the zero ancilla factories supplying these encoders. 



ancilla factories producing for QEC. Column 5 includes 
both 7T./8 encoding factories and sufficient encoded zero 
ancilla factories to supply the tc/8 encoding factories. 

We see that even the most serial of the benchmarks, the 
Quantum Ripple-Carry Adder, requires a substantial por- 
tion of the chip (two-thirds) dedicated to encoded ancilla 
generation in order to take this generation off the execu- 
tion's critical path, while the more parallel QCLA requires 
more than 90%. 

5.2 Latency/Area Evaluation 

The proposals for both QLA and CQLA specify space for 
only serial production of ancillae at each encoded data 
qubit location. We generalize this to GQLA and GCQLA 
in which we replicate the ancilla area at each data qubit 
to allow parallel production of ancillae. CQLA has addi- 
tional flexibility in that different numbers of data units can 
be present in the compute cache. We wish to quantify the 
efficiency of ancilla production in each microarchitecture 
by studying area needed for a given execution time. 

Methodology: Using dataflow graphs of our bench- 
marks and the estimates in Tables |5][8] we implemented 
an event-based simulation of ancilla factory production 
and data qubit gate consumption. Simulation of the QLA 
ll22l microarchitecture assumes that each data qubit in the 
computation has a dedicated cell with ancilla production. 
Data qubits are always moved back to their home base to 



do the error correction after each encoded gate. We simu- 
late dataflow execution taking into account latency of the 
ancilla production and encoded gate execution, using la- 
tencies from Tables [5] and[7] 

CQLA lfT31 optimizes the QLA design by adding a 
cache of data qubits that are in the current working set. To 
simulate this, we added tracking of which qubits are in the 
"compute cache" and account for cache miss and write- 
back latencies. This was the most complicated simulation 
and has an implementation similar to that of sim-cache in 
SimpleScalar ||23l . We used the same basic ancilla pro- 
duction and data gate latencies as for QLA. 

Results: Figure[T5]shows overall circuit execution time 
as a function of total area dedicated to ancilla factories 
(of both types) for the different microarchitectures being 
tested for QRCA (left), QCLA (middle) and QFT (right). 
Total data qubit area is given in the caption for each. 

We notice that CQLA takes about half an order to 
an order of magnitude longer to execute than Fully- 
Multiplexed Ancilla Distribution. This is due to the 
incurrence of cache misses in CQLA, whereas Fully- 
Multiplexed always distributes encoded ancillae to data 
when necessary. CQLA also plateaus half an order to an 
order of magnitude higher than Fully-Multiplexed since, 
even with very fast encoded ancilla production, cached 
misses are still incurred to bring ancillae to data. 

QLA requires two orders of magnitude more area for 
ancilla production to match execution time with Fully- 
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Figure 15: Execution time as a function of total area of encoded ancilla factories. (Left) 32-bit QRCA, Data qubit area = 679 
macroblocks; (Middle) 32-bit QCLA, Data qubit area = 861 macroblocks; (Right) 32-bit QFT, Data qubit area = 224 macroblocks. 



Multiplexed, which is logical since many ancilla genera- 
tors are idle much of the time in QLA when they could be 
used to feed nearby data need. On the other hand, QLA 
eventually plateaus at a similar execution time as Fully- 
Multiplexed, which makes sense since it has no concept 
of cache misses. QLA simply needs very high encoded 
ancilla production at each data qubit in order to run at the 
speed of data. 



5.3 Qalypso: Microarchitectural Implica- 
tions of Pipelined Ancilla Factories 

The simple encoded zero ancilla factory in Figure [TTI has 
an area of 90 macroblocks and a throughput of 3.1 en- 
coded ancillae per millisecond. The pipelined encoded 
zero ancilla factory designed in Section [4~4l has an area of 
298 macroblocks and a throughput of 10.5 encoded ancil- 
lae / ms. They produce virtually the same encoded zero 
ancilla bandwidth per unit area, thus seemingly negating 
some of the benefits of pipelinings 

Nonetheless, we conclude that pipelined ancilla facto- 
ries provide significant benefit in having concentrated in- 
put and output "ports." We propose Qalypso, a tiled mi- 
croarchitecture shown in Figure [16h using the tile shown 
in Figure [16b, with ballistic movement being used within 
a tile and teleportation of data between tiles lfl6ll . The 
central data region consists of a dense packing of en- 
coded data qubits and channels for local ballistic move- 
ment. The ancilla factories each have an output port phys- 
ically near the data region so encoded ancillae do not have 
far to travel. This is beneficial both in reducing aggre- 
gate movement error on encoded ancillae and in avoiding 
congestion problems from having encoded ancillae gener- 
ated uniformly throughout an ancilla factory. Meanwhile, 
since the limiting factor on move speed in ion traps is state 
decoherence rather than control of the electrodes, stateless 



This is a result of the facts that the technology is inherently syn- 
chronous and that individual gate locations are multi-purpose. 



qubits may be recycled to factory input ports much more 
quickly, allowing the input ports to be far from the data. 

This architecture differs from (C)QLA in two signifi- 
cant respects. First, our data regions consist of data alone. 
In CQLA, the compute regions consist of both data and 
ancilla generation units, meaning that data are physically 
quite a bit further apart even within one compute region 
and generally require teleportation for movement. Even 
if QEC were performed as part of teleportation 1241 . this 
requires twice as many encoded ancillae as a straightfor- 
ward QEC step. Thus, we suggest that our data regions 
be made as large as possible to allow data qubits to reach 
each other using ballistic movement instead of teleporta- 
tion as much as possible. Though ballistic movement is 
somewhat error prone, the area of a data region consisting 
of nothing but encoded data qubits is still quite small, so 
teleportation is only necessary between data regions. 

Second, ancilla factories surrounding a data region in 
our design are shared by all data qubits within that region. 
In Figure [T4h, which represents the (C)QLA microarchi- 
tecture, each ancilla generator is dedicated to a single data 
qubit (location), so imbalances in encoded ancilla need 
cause some generators to go idle while others cannot meet 
need. By having a full crossbar between generators and 
consumers (data qubits), as in Figure [14b . fresh ancillae 
go where they are needed within a single data region. 

The choice of data region size is still an open problem 
and depends on the level of parallelism in the target appli- 
cation. The determining factors will likely be local move- 
ment congestion within the data region and load on the 
inter-tile interconnect, which are shown as the grey boxes 
in Figure [T6k . Analyses concerning these trade-offs will 
be the subject of future research. 

6 Conclusion 

We find that ancilla generation bandwidth in a quantum 
computer is the primary performance bottleneck, and we 
present a microarchitecture that takes this bottleneck off 
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Figure 16: (a) Qalypso: our proposed microarchitecture, (b) A single tile consists of a dense data region surrounded by ancilla 
factories funneling encoded ancillae as need arises. Ancilla distribution is fully multiplexed within each tile, with factory output 
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the critical path. We examine two major consumers of an- 
cillae: quantum error correction (QEC) and non-traversal 
quantum gates, such as the Jt/8 gate for the [[7,1,3]] CSS 
code. We characterize our benchmarks to find bandwidth 
needs ranging from 30 to 300 encoded zero ancillae / ms 
and ranging from 7 to 60 encoded Jt/8 ancillae / ms for 
ion trap quantum computers. 

Our resulting microarchitecture, Qalypso, is optimized 
for ancilla generation and distribution, featuring dense 
data-only regions fed by nearby ancilla factories. We 
present layouts for these ancilla factories and show that 
ancilla generation takes the majority of the chip area even 
in the most serial of our circuits, the ripple-carry adder. As 
an interesting aside, we find that pipelining does not have 
the same beneficial impact on throughput as in classical 
circuits but does provide an important structural benefit: 
it can produce high bandwidth ancillae directed at a single 
output port. Qalypso can produce circuits of similar speed 
to previous architectures with greatly reduced resources or 
alternatively can produce circuits of much greater speed 
than previous architectures for similar area. 
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